This is probably one of the most important post on alignment on this forum. Seriously. I want everyone thinking about conceptual alignment, and everyone trying conceptual alignment, to read this and think about it deeply.
What this gives us is a way of combining the output of many disparate epistemic strategies to get well structured and directly relevant knowledge about alignment and how our proposals would fare. This is great, because now, we can combine many different methods of investigation (theory arguments, philosophical approaches, empirical studies of analogous systems and problems) and try to tie them to a common narrative (pun intended) about alignment.
Of course, we should expect that some things we want to learn about don't fit neatly in there, but training stories are still surprisingly inclusive. For example we could expect that reasoning about potential problems of AGI, in the very conceptual/philosophical/theoretical way we favor on the AF, doesn't fit a framework focused on justifying a given approach. Yet training stories also includes the probing of their rationale, and finding a new problem/issue allows new probing and refinement, like the very theoretical computer science model presented by Paul in his research methodology post.
There is indeed one thing this post doesn't get into: exactly which epistemic strategies can and should we use to argue for each part of a training story, and to break and falsify each. Still, I find that having a framing for combining and linking the output of the existing and new epistemic strategies is already quite an accomplishment. Plus it leaves me some work to do on clarifying and distilling the epistemic strategies of alignment.
Last but not least, I really like the name "story" for two reasons:
This is probably one of the most important post on alignment on this forum. Seriously. I want everyone thinking about conceptual alignment, and everyone trying conceptual alignment, to read this and think about it deeply.
Glad you think so! I definitely agree and am planning on using this framework in my own research going forward.
"story" makes technical people feel uncomfortable. We immediately fear weird justification and biases towards believing interesting stories. And we should be wary of this when working on alignment, while acknowledging that most of our knowledge will take a form like that. So the word reminds us daily to not feel too comfortable with our ideas and intuitions, as we always risk falling for our own inventions.
Yep, this is definitely intentional. I think in many ways just thinking about inner alignment as avoiding proxy-aligned mesa-optimizers can give you false confidence in your training story because you reason “of course I won't get that specific failure model”—but the problem is that you need to couple some reason that you won't get the wrong thing with some strong reason that you actually will get the right thing to really be confident in your training process's safety.
I found myself coming back to this now, years later, and feeling like it is massively underrated. Idk, it seems like the concept of training stories is great and much better than e.g. "we have to solve inner alignment and also outer alignment" or "we just have to make sure it isn't scheming."
Anyone -- and in particular Evhub -- have updated views on this post with the benefit of hindsight? Should we e.g. try to get model cards to include training stories?
Anyone -- and in particular Evhub -- have updated views on this post with the benefit of hindsight?
I intuitively don't like this approach, but I have trouble articulating exactly why. I've tried to explain a bit in this comment, but I don't think I'm quite saying the right thing.
One issue I have is that it doesn't seem to nicely handle interactions between the properties of the AI and how it's used. You can have an AI which is safe when used in some ways, but not always. This could be due to approaches like control (which mostly route around mechanistic properties of the AI), but also potentially things like using monitoring ensembles to handle lack of robustness and paying AIs rather than aligning them.
Another problem I have is that this doesn't very naturally incorporate various non-mechanistic analysis targeting specific threat models which IMO should be (and will be) very central. E.g., we built a wide variety of model organisms which are closely analogous to our training and deployment environment and which aim to uncover potential reward hacking failure modes and these model organisms didn't demonstrate any issues. Same for things like adversarially testing for clear misalignment: it doesn't result in a mechanistic model, but feels very central.
To be clear, I think all the things I discussed above can be discussed in this framework, but it feels quite unnatural and the decomposition doesn't seem like it's doing any work.
I think the type of mechanistic analysis proposed here seems quite aspirational with the current state of technology such that it feels odd to center it. Or the mechanistic analysis you do will apply to all training runs and no safety interventions will effect it such that it's more like useful background than a key part of analyzing different safety measures. To be clear, we will want to do some mechanistic analysis and have some space of mechanistic hypothesis. But this feels more like the background threat model than the core safety case due to difficulties in testing. We can also somewhat test these mechanistic hypothesis with experiments that don't require huge technological break throughs, but this seems more like an important sub-component of a safety case than the main thing.
Perhaps Evan thinks we're totally screwed (or at least can't obtain high confidence) without strong mechanistic analysis such that centering this is good. I think high confidence seems unclear and disagree with totally screwed. It's possible that my views here partially come down to a difference of opinion with Evan where he thinks that deceptive alignment is very likely given usage of models capable of powerful goal-oriented behavior where as I think this is uncertain. Further, I think it's reasonably likely (perhaps 1/3) that I'll end up being very confident that deceptive alignment is very unlikely at the point when we have powerful AIs (due to experiments and further conceptual reasoning).
More generally, I feel like the way I currently talk and think about safety cases and similar topics doesn't seem nicely fit into training stories. I think the way I currently do it is better, but I'm not entirely certain and I haven't tried the training stories approach much.
I should also note that a general approach like training stories seems much better than a decomposition like "inner alignment" vs "outer alignment" which is supposing a particular approach to solving the problem. (I do think that "inner misalignment" vs "outer misalignment" is reasonable decomposition of threat models in AIs produced with ML. But these are threat models, not problems to be solved and there are many routes to solving them. See here for more discussion.)
I think I prefer the default trajectory of safety cases and RSP more than what would happen with additional emphasis on training stories, but I'm uncertain.
Thanks to Rohin Shah, Ajeya Cotra, Richard Ngo, Paul Christiano, Jon Uesato, Kate Woolverton, Beth Barnes, and William Saunders for helpful comments and feedback.
Evaluating proposals for building safe advanced AI—and actually building any degree of confidence in their safety or lack thereof—is extremely difficult. Previously, in “An overview of 11 proposals for building safe advanced AI,” I tried evaluating such proposals on the axes of outer alignment, inner alignment, training competitiveness, and performance competitiveness. While I think that those criteria were good for posing open questions, they didn’t lend themselves well to actually helping us understand what assumptions needed to hold for any particular proposal to work. Furthermore, if you’ve read that paper/post, you’ll notice that those evaluation criteria don’t even work for some of the proposals on that list, most notably Microscope AI and STEM AI, which aren’t trying to be outer aligned and don’t really have a coherent notion of inner alignment either.
Thus, I think we need a better alternative for evaluating such proposals—and actually helping us figure out what needs to be true for us to be confident in them—and I want to try to offer it in the form of training stories. My hope is that training stories will provide:
What’s a training story?
When you train a neural network, you don’t have direct control over what algorithm that network ends up implementing. You do get to incentivize it to have some particular behavior over the training data, so you might say “whatever algorithm it’s implementing, it has to be one that’s good at predicting webtext”—but that doesn’t tell you how your model is going to go about accomplishing that task. But exactly how your model learns to accomplish the task that you give it matters quite a lot, since that’s what determines how your model is going to generalize to new data—which is precisely where most of the safety concerns are. A training story is a story of how you think training is going to go and what sort of model you think you’re going to get at the end, as a way of explaining how you’re planning on dealing with that very fundamental question of how your model is going to learn to accomplish the task that you give it.
Let’s consider cat classification as an example. Right now, if you asked a machine learning researcher what their goal is in training a cat classifier, they’d probably say something like “we want to train a model that distinguishes cats from non-cats.” The problem with that sort of a training story, however, is that it only describes the desired behavior for the model to have, not the desired mechanism for how the model might achieve that behavior. Instead of such “behavioral training stories,” for the rest of the post when I say “training story,” I want to specifically reference mechanistic training stories—stories of how training goes in terms of what sort of algorithm the model you get at the end is implementing, not just behaviorally what your model does on the training distribution. For example, a mechanistic training story for cat classification might look like:
I think that there are a bunch of things that are nice about the above story. First, if the above story is true, it’s sufficient for safety—it precisely describes a story for how training is supposed to go such that the resulting model is safe. Furthermore, such a story makes pretty explicit what could go wrong such that the resulting model wouldn’t be safe—in this case, if the simplest cat-detecting neural network was an agent or an optimization process that terminally valued distinguishing cats from non-cats. I think that explicitly stating what assumptions are being made about what model you’re going to get is important, since at some point you could get an agent/optimizer rather than just a bunch of heuristics.[1]
Second, such a story is highly falsifiable—in fact, as we now know from work like Ilyas et al.’s “Adversarial Examples Are Not Bugs, They Are Features,” the sorts of cat-detection heuristics that neural networks generally learn are often not very human-like. Of course, I picked this story explicitly because it made plausible claims that we can now actually falsify. Though every training story should have to make falsifiable claims about what mechanistically the model should be doing, those claims could be quite difficult in general to falsify, as our ability to understand anything about what our models are doing mechanistically is quite limited. While this might seem like a failure of training stories, in some sense I think it’s also a strength, as it explicitly makes clear the importance of better tools for analyzing/falsifying facts about what our models are doing.
Third, training stories like the above can be formulated for essentially any situation where you’re trying to train a model to accomplish a task—not only are training stories useful for complex alignment proposals, as we’ll see later, but they also apply even to simple cat detection, as in the story above. In fact, though it’s what I primarily want them for, I don’t think that there’s any reason that training stories need to be exclusively for large/advanced/general/transformative AI projects. In my opinion, any AI project that has cause to be concerned about risks/dangers should have a training story. Furthermore, since I think it will likely get difficult to tell in the future whether there should be such cause for concern, I think that the world would be a much better place if every AI project—e.g. every NeurIPS paper—said what their training story was.
Training story components
To help facilitate the creation of good training stories, I’m going to propose that every training story at least have the following basic parts:
Note that there is some tension in the above notion of a training goal, which is that, if you have to know from a mechanistic/algorithm perspective exactly what you want your model to be doing, then what’s the point of using machine learning if you could just implement that algorithm yourself? The answer to this tension is that the training goal doesn’t need to be quite that precise—but exactly how precise it should be is a tricky question that I’ll go into in more detail in the next section.
For now, within the above two basic parts, I want to break each down into two pieces, giving us the full four components that I think any training story needs to have:
As an example of applying these components, let’s reformulate the cat detection training story using these four basic components:
How mechanistic does a training goal need to be?
One potential difficulty in formulating training goals as described above is determining what to specify and what to leave unspecified in the training goal specification. Specify too little and your training goal specification won’t be constraining enough to ensure that any model that meets it is desirable—but specify too much, and why are you even using machine learning in the first place if you already know precisely what algorithm you want the resulting model to implement?
In practice, I think it’s always a good idea to be as precise as you can—so the real question is, how precise do you need to be for a description to work well as a training goal specification? Fundamentally, there are two constraining factors: the first is training goal desirability—the more precise your training goal, the easier to argue that any model that meets it is desirable—and the second is the training rationale—how hard is it actually going to be in practice to ensure that you get that specific training goal.
Though it might seem like these two factors are pushing in opposite directions—training goal desirability towards a more precise goal and the difficulty of formulating a training rationale towards a more general goal—I think that’s actually not true. Formulating a good training rationale can often be much easier for a more precise training goal. For example, if your training goal is “a safe model,” that’s a very broad goal, but an extremely difficult one to ensure that you actually achieve. In fact, I would argue, creating a training rationale for the training goal of “a safe model” is likely to require putting an entire additional training story in your training rationale, as you’ve effectively gone down a level without actually reducing the original problem at all. The factors that, in my opinion, actually make a training goal specification easier to build a training rationale for aren’t generality, but rather questions like how natural the goal is in terms of the inductive biases of the training process, how much it corresponds to aspects of the model that we know how to look for, how easily it can be broken down into individually checkable pieces, etc.
As a concrete example of how precise a training goal should be, I’m going to compare two different ways in which Paul Christiano has described a type of model that he’d like to build.[2] First, consider how Paul describes corrigibility:
In my opinion, a description like the above would do very poorly as a training goal specification. Though Paul’s description of corrigibility specifies a bunch of things that a corrigible model should do, it doesn’t describe them in a way that actually pins down how the model should do those things. Thus, if you try to just build a training rationale for how to get something like the above, I think you’re likely to just get stuck on what sort of model you could try to train that, in the broad space of possible models, would actually have those properties.
Now, compare Paul’s description of corrigibility above to Paul’s description of the “intended model” in “Teaching ML to answer questions honestly instead of predicting human answers:”
Paul’s first paragraph here can clearly be interpreted as a training goal specification with the latter two paragraphs being training goal desirability—and in this case I think this is exactly what a training goal should look like. Paul describes a specific mechanism for how the intended model works—using an honest mapping from its internal world-model to natural language—and explains why such a model would work well and what might go wrong if you instead got something that didn’t quite match that description. In this case, I don’t think that Paul’s training goal specification above would actually work for training a competitive system—and Paul doesn’t intend it that way—but nevertheless, I think it’s a good example of what I think a mechanistic training goal should look like.
Looking forward, I’d like to be able to develop training goals that are even more specific and mechanistic than Paul’s “intended model.” Primarily, that’s because the more specific/mechanistic we can get our training goals, the more room that we should eventually have for failure in our training rationales—if a training goal is very specific, then even if we miss it slightly, we should hopefully still end up in a safe part of the overall model space. Ideally, as I discuss later, I’d like to have rigorous sensitivity analyses of things like “if the training rationale is slightly wrong in this way, by how much do we miss the training goal”—but getting there is going to require both more specific/mechanistic training goals as well as a much better understanding of when training rationales can fail. For now, though, I’d like to set the bar for “how mechanistic/precise should a training goal specification be” to “at least as mechanistic/precise as Paul’s description above.”
Relationship to inner alignment
The point of training stories is not to do away with concepts like mesa-optimization, inner alignment, or objective misgeneralization. Rather, the point of training stories is to provide a universal framework in which all of those sorts of concepts can live as discrete subproblems—specific ways in which a training story might go wrong.
Thus, here’s my training-stories-centric glossary of many of these other terms that you might encounter around AI safety:
It’s worth pointing out how phrasing inner and outer alignment in terms of training stories makes clear what I think was our biggest mistake in formulating that terminology, which is that inner/outer alignment presumes that the right way to build an aligned model is to find an aligned loss function and then have a training goal of finding a model that optimizes for that loss function. However, as I hope the more general framework of training stories should make clear, there are many possible ways of trying to train an aligned model. Microscope AI and STEM AI are examples that I mentioned previously, but in general any approach that intends to use a loss function that would be problematic if directly optimized for, but then attempts to train a model that doesn’t directly optimize for that loss function, would fail on both outer and inner alignment—and yet might still result in an aligned model.
One of my hopes with training stories is that it will help us better think about approaches in the broader space that Microscope AI and STEM AI operate in, rather than just feeling constrained to approaches that fit nicely within the paradigm of inner alignment.
Do training stories capture all possible ways of addressing AI safety?
Though training stories are meant to be a very general framework—more general than outer/inner alignment, for example—there are still approaches to AI safety that aren’t covered by training stories. For example:
Evaluating proposals for building safe advanced AI
Though I’ve described how I think training stories should be constructed—that is, using the four components I detailed previously—I haven’t explained how I think training stories should be evaluated.
Thus, I want to introduce the following four criteria for evaluating a training story to build safe advanced AI. These criteria are based on the criteria I used in “An overview of 11 proposals for building safe advanced AI,” but adopted for the training stories setting. Note that these criteria should only be used for proposals for advanced/transformative/general AI, not just any AI project. Though I think that the general training stories framework is applicable to any AI project, these specific evaluation criteria are only for proposals for building advanced AI systems.
Training goal …
Training rationale …
Case study: Microscope AI
In this section, I want to take a look at a particular concrete proposal for building safe advanced AI that I think is hard to evaluate properly without training stories, and show that, with training stories, we can easily make sense of what it’s trying to do and how it might or might not succeed.
That proposal is Chris Olah’s Microscope AI. Here’s my rendition of a training story for Microscope AI:
Now, we’ll try to evaluate Microscope AI’s training story using our four criteria from above:
Training goal …
… alignment: Training goal alignment for Microscope AI might seem trivial, as it seems like the training goal of a purely predictive model just shouldn’t be dangerous.
However, there are potential safety issues even with purely predictive models—in particular, once a predictor starts predicting a world that involves itself, it runs into self-reference problems that might have multiple fixed points, some of which could be quite bad. For example: a pure predictor might predict that the world will be destroyed and replaced by a new, very easy-to-predict world in such a way that causes precisely that to happen. Exactly that scenario would likely require the predictor to be choosing its predictions to optimize the world to be easy to predict, which might be ruled out by the training goal (depending on exactly how it’s specified), but the general problem of how a predictor should handle self-fulfilling prophecies remains regardless. Though the training goal that I gave previously enforces that the model not be “reasoning about the effects of its predictions on the world,” exactly how to do that, given that its predictions are in fact a part of the world, is non-trivial. For more detail on this sort of scenario, see Abram Demski’s “The Parable of Predict-O-Matic.”
Fortunately for Microscope AI, however, such a scenario should hopefully be off the table, as the goal isn’t to ever actually use the model’s predictions, but rather just to extract the concepts the model is using to make its predictions—and since the model is just trained on prediction, it shouldn’t ever have to predict anything that depends on its own predictions (though this depends on the variant of self-supervised training being used). However, it’s unclear whether that’s enough to fully address such concerns—the model’s internal concepts can be thought of as a type of prediction/output, and if the model is actively selecting those concepts to achieve a particular result, as above, that could be quite bad. Thus, for Microscope AI to succeed on training goal alignment, the training goal specification likely needs to be quite firm in ruling out all possible such optimization over the world.
… competitiveness: There are a couple of major factors to address regarding training goal competitiveness for Microscope AI.
First is whether interpreting the training goal would actually meaningfully enhance human understanding. For that to be the case, transparency and interpretability tools would have to be capable of extracting useful knowledge that humans can understand but currently don’t. For example, if transparency and interpretability tools were limited to just determining whether some piece of information exists in a model, Microscope AI would be useless, as only information that humans already knew to look for could be extracted. Furthermore, the information extracted has to actually be information that’s useful to humans in making the sorts of decisions that humans want to make effectively—which could be especially difficult given that the actual information in the model is only ever optimized for prediction, not for making decisions, which could end up making it not very useful for the process of decision-making.
Second, for Microscope AI to be training goal competitive, enhanced human understanding alone—without the ability to build any agentic systems—has to be sufficient for the economic use cases where one might otherwise want highly agentic advanced AI (e.g. an AGI). From “An overview of 11 proposals:”
Training rationale …
… alignment: In discussing training goal alignment, we established that the training goal needs to quite firmly rule out any optimization over the world. This creates a difficult challenge for the training rationale, however, as enforcing the absence of optimization over the world could be quite difficult.
In particular, predicting a world that involves optimization might require optimization, potentially pushing any predictive model towards becoming an agent. In particular, any world model which needs to be able to answer queries of the form “what is the probability that a is the optimal action according to utility u” or “what is the best action for agent A to take in state x” would likely have to implicitly be running some sort of an optimization algorithm. Given that the world does in fact contain lots of optimizers—humans, for example—being able to handle such queries seems likely to be necessary to be able to efficiently predict the world. Thus, enforcing something like “pure prediction” could be quite difficult to do while also maintaining accuracy on questions involving humans. Furthermore, even if such enforcement could be done, it seems like it would put a very large burden on the transparency tools being used to do that enforcement, as the training process itself would likely be fighting against the enforcement mechanism—since just using optimization would still be a good way for the training process to modify the model to predict the data well. This could be especially concerning if it allows the model to start performing optimization and become deceptive before that optimization can be detected.
Furthermore, the training goal also requires that the model not just be a pure predictor, but also be using human-level concepts to do its prediction. While it might seem like this would just be falsified in the same way as the cat detection training story, Chris Olah argues that more powerful models should actually be more likely to use human-level concepts, at least up until those models get human-level capabilities. Under such a worldview, we only see models learning non-human abstractions because we are currently operating in a “valley of confused abstractions,” but once we get human-level models, they’ll learn human-level concepts. Worryingly, however, Chris also predicts that, as our models then surpass human-level, we’ll start to get “increasingly alien abstractions,” which could again cause problems for Microscope AI.
… competitiveness: Training rationale competitiveness is one area where Microscope AI generally does pretty well, as self-supervised learning is something we already know how to do and do efficiently. The biggest potential training rationale competitiveness issue, however, would be if the use of transparency tools during training to enforce the training goal—e.g. to check for optimization—significantly slowed down the training process or were otherwise too expensive. For example—if it’s necessary for humans to use transparency tools to fully reevaluate the model at each training step, that could end up being pretty uncompetitive. As such, it seems likely that we’ll need at least some progress in automated transparency to make Microscope AI’s training rationale competitive.
Compared to my previous analysis of Microscope AI, I think that this version is much more clear, easy to evaluate, and possible to locate concrete open problems in. For example, rather than my previous outer alignment analysis that simply stated that Microscope AI wasn’t outer aligned and wasn’t trying to be, we now have a very clear idea of what it is trying to be and an evaluation of that specific goal.
Exploring the landscape of possible training stories
Though I like the above Microscope AI example for showcasing one particular training story for building safe advanced AI and how it can be evaluated, I also want to spend some time looking into the broader space of all possible training stories. To do that, I want to look at some of the broad classes that training goals and training stories can fall into other than the ones that we just saw with Microscope AI. By no means should anything here be considered a complete list, however—in fact, my sense is that we’re currently only scratching the surface of all possible types of training goals and plans.
We’ll start with some possible broad classes of training goals.
All of the above ideas are exclusively training goals, however—for any of them to be made into a full training story, they’d need to be combined with some specific training rationale for how to achieve them. Thus, I also want to explore what some possible classes of training rationales might look like. Remember that a training rationale isn’t just a description of what will be done to train the model—so you won’t see anything like “do RL” or even “do recursive reward modeling” on this list—rather, a training rationale is a story for how/why some approach like that will actually succeed.
Capability limitations: One somewhat obvious training rationale—but that I think is nevertheless worth calling attention to, as I think it can often be quite useful—is analyzing whether a model would actually have the capabilities to do any sort of bad/undesirable thing. For example: for many current systems, they may just not have the model capacity to learn the sorts of algorithms—e.g. optimization algorithms—that might be dangerous. To make these sorts of training rationales maximally concrete and falsifiable, I think a good way to formulate a training rationale of this form is to isolate a particular sort of capability that is believed to be necessary for a particular type of undesirable behavior and combine that with whatever evidence there is for why a model produced by the given training process wouldn’t have that capability. For example: if the ability to understand how to deceive humans is a necessary capability for deception, then determining that such a capability would be absent could serve as a good training rationale for why deception wouldn’t occur. Unfortunately, current large language models seem to be capable of understanding how to deceive humans, making that specific example insufficient.
Inductive bias analysis: Inductive bias analysis is the approach of attempting to carefully understand the inductive biases of a training process enough to be able to predict what sort of model will be learned. For example, any approach which attempts to predict what the “simplest” model will be given some training procedure and dataset is relying on inductive bias analysis—as in both the cat detection and Microscope AI training stories that we’ve seen previously.
Inductive bias analysis is a very tempting approach, as it allows us to essentially just do standard machine learning and have a good idea of what sort of model it’ll produce. Unfortunately, once you start being very careful about your inductive bias analysis and working everything out mathematically—as in “Answering questions honestly instead of predicting human answers: lots of problems and some solutions”—it starts to get very tricky and very difficult to do successfully. This is especially problematic given how inductive bias analysis essentially requires getting everything right before training begins, as a purely inductive-bias-analysis-based training rationale doesn’t provide any mechanism for verifying that the right training goal is actually being learned during training.
Hopefully, however, more results like deep double descent, lottery tickets, scaling laws, grokking, or distributional generalization will help us build better theories of neural network inductive biases and thus become more confident in any inductive-bias-analysis-based training stories.
Transparency and interpretability: As we saw in Microscope AI’s use of transparency tools to check for unwanted optimization/agency, the use of transparency tools during training can be a very useful component of a training rationale, helping to verify that the right sort of algorithm is being learned. Though the training story I gave above for Microscope AI stated that it was primarily relying on inductive bias analysis, an approach that primarily relies on transparency tools would also be a possibility. Even then, however, some inductive bias analysis would likely still be necessary—e.g. “We think that our transparency checks will rule out all simple models that don’t fit the training goal, with all remaining models that don’t fit the goal being too complex according to the inductive biases of the training process to possibly be learned.”
It’s worth noting, however, that all of the above uses of transparency tools rely on worst-case transparency—that is, the ability to actively check for a particular problem anywhere in a model rather than just the ability to understand some particular part of a model—which is something that transparency and interpretability currently still struggles with. Nevertheless, I think that transparency-and-interpretability-based training rationales are some of the most exciting, as unlike inductive bias analysis, they actually provide feedback during training, potentially letting us see problems as they arise rather than having to get everything right in advance.
Automated oversight: One way to significantly enhance the utility of transparency and interpretability tools is to not purely rely on humans being the ones deploying them—both because humans are slow and expensive, but also because humans are only capable of understanding human-level concepts. Thus, if you expect models to use concepts that are complex, alien, or otherwise difficult for humans to understand—as in the “increasingly alien abstractions” part of Chris Olah’s graph of interpretability vs. model strength—then using models that understand those concepts to do the interpretability work could potentially be a good way to ensure that interpretability continues working in such a regime.
Of course, this raises the issue of how to ensure that the models doing the interpretability/oversight are themselves safe. One solution to this problem is to use a form of recursive oversight, in which the overseer model and the model being overseen are the same model, variants of the same model, or otherwise recursively overseeing each other. For a more thorough exploration of what such an approach might look like, see “Relaxed adversarial training for inner alignment.”
AI cognitive science: In addition to the “neuroscience approach” of using transparency and interpretability to understand what our models are doing—since transparency is about looking inside models “brains”—there is also the “cognitive science” approach of proposing theories about what models are doing internally and then testing them via behavioral experiment. An example of this sort of approach would be Deletang et al.’s “Causal Analysis of Agent Behavior for AI Safety,” wherein the authors construct causal models of what agents might be doing and then test them through causal interventions.
One problem with this style of approach, however, is that it gives us much less direct information about what algorithm a model might be implementing, as it still relies on purely behavioral information about what the model appears to do, rather than structural information about what the model is doing internally that transparency approaches could hopefully produce. Thus, training rationales based on AI cognitive science approaches might have to rely on some degree of extrapolation from experiments on other, similar models—extrapolation that could have difficulty predicting new problems that only arise with larger/more powerful systems, which is a potential issue for any training rationale primarily based on this sort of an approach.
Precursor checking: Another general type of training rationale that I think is worth calling attention to is what I’ll call “precursor checking,” which is the concept of using some method of gaining information about a model’s internals—e.g. transparency/interpretability or AI cognitive science—to check for some precursor to bad behavior rather than the bad behavior itself. This could involve substituting in some narrower, easier to check training goal—that still falls within the broader actual training goal—as the target for the training rationale. For example, if your training rationale involves ensuring that you don’t get a deceptive model that’s actively trying to trick its training process, then rather than explicitly trying to look for such deception (which could be especially hard since a deceptive model might actively try to avoid detection), you could instead try to ensure that your model has a short horizon length in terms of how far ahead its planning. Such a plan might work better, since horizon length might be easier to guarantee in a training rationale while still being consistent with the desired training goal and hopefully ruling out the possibility of deception.[4] One issue with this sort of approach, however, is that you have to guarantee that whatever precursor for bad behavior you’re looking for is in fact a necessary condition for that bad behavior—if it turns out that there’s another way of getting that bad behavior that doesn’t go through the precursor, that could be a problem.
Loss landscape analysis: An extension of inductive bias analysis, I think of loss landscape analysis as describing the sort of inductive bias analysis that focuses on the path-dependence of the training process. For example: if you can identify large barriers in the loss landscape, you can potentially use that to narrow down the space of possible trajectories through model space that a training process might take and thus the sorts of models that it might produce. Loss landscape analysis could be especially useful if used in conjunction with precursor checking, since compared to pure inductive bias analysis, loss landscape analysis could help you say more things about what precursors will be learned, not just what final equilibria will be learned. Loss landscape analysis could even be combined with transparency tools or automated oversight to help you artificially create barriers in the loss landscape based on what the overseer/transparency tools are detecting in the model at various points in training.
Game-theoretic/evolutionary analysis: In the context of a multi-agent training setup, another type of training rationale could be to understand what sorts of models a training process might produce by looking at the game-theoretic equilibria/incentives of the multi-agent setting. One tricky thing with this style of approach, however, is avoiding the assumption that the agents would actually be acting to optimize their given reward functions, since such an assumption is implicitly assuming that you get the training goal of a loss-minimizing model. Instead, such an analysis would need to focus on what sorts of algorithms would tend to be selected for by the emergent multi-agent dynamics in such an environment—a type of analysis that’s perhaps most similar to the sort of analysis done by evolutionary biologists to understand why evolution ends up selecting for particular organisms, suggesting that such evolutionary analysis might be quite useful here. For a more detailed exploration of what a training rationale in this sort of a context might look like, see Richard Ngo’s “Shaping safer goals.”
Given such a classification of training rationales, we can label various different AI safety approaches based on what sort of training goal they have in mind and what sort of training rationale they want to use to ensure that they get there. For example, Paul Christiano’s “Teaching ML to answer questions honestly instead of predicting human answers,” that I quoted from previously, can very straightforwardly be thought of as an exercise in using inductive bias analysis to ensure a truthful question-answerer.
Additionally, more than just presenting a list of possible training goals and training rationales, I hope that these lists open up the possibility for what other strategies for building safe advanced AI might be possible than those that have been previously proposed. This includes both novel ways to combine a training goal with a training rationale—e.g. what if you used inductive bias analysis to get a myopic agent or AI cognitive science to get a narrow agent?—as well as gesturing to the general space of possible training goals and plans that likely includes many more possibilities that we’ve yet to consider.
Training story sensitivity analysis
If we do start using training stories regularly for reasoning about AI projects, we’re going to have to grapple with what happens when training stories fail—because, as we’ve already seen with e.g. the cat detection training story from earlier, seemingly plausible training stories can and will fail. Ideally, we’d like it to always be the case that training stories fail safely: especially when it comes to particularly risky failure modes such as deceptive alignment, rather than risk getting a deceptive model, we’d much rather training just not work. Furthermore, if always failing safely is too difficult, we’ll need to have good guarantees regarding the degree to which a training story can fail and in what areas failure is most likely.
In all of these cases, I want to refer to this sort of work as training story sensitivity analysis. Sensitivity analysis in general is the study of how the uncertainty in the inputs to something affects its outputs. In the case of training stories, that means answering questions like “how sensitive is this training rationale to changes in its assumptions about the inductive biases of neural networks?” and “in the situations where the training story fails, how likely is it to fail safely vs. catastrophically?” There are lots of ways to start answering questions like this, but here are some examples of the sorts of ways in which we might be able to do training story sensitivity analysis:
Hopefully, as we build better training stories, we’ll also be able to build better tools for their sensitivity analysis so we can actually build real confidence in what sort of model our training processes will produce.
It’s worth noting that there are ways to potentially build advanced or transformative AI that don’t assume the emergency of agency (and in fact might rely on the opposite) such as the aforementioned Microscope AI or STEM AI. ↩︎
Obviously this isn’t fair because in neither of these cases was Paul trying to write a training goal; but nevertheless I think that the second example that I give is a really good example of what I think a training goal should look like. ↩︎
For example, instead of using transparency and interpretability tools, you might instead try to make use of AI cognitive science, as I discuss in the final section on “Exploring the landscape of possible training stories.” ↩︎
It’s worth noting that while guaranteeing a short horizon length might be quite helpful for preventing deception, a short horizon length alone isn’t necessarily enough to guarantee the absence of deception, since e.g. a model with a short horizon length might cooperate with future versions of itself in such a way that looks more like a model with a long horizon length. See “Open Problems with Myopia” for more detail here. ↩︎